Introduction to Open Data Science - Course Project


Thoughts about the Course

Looking forward to learning some R. I have a decent bit of experience in programming and data science, but I have done almost nothing in R before so that should be interesting. To be honest, not a 100% sure if I’ll have the time to finish the course but we’ll see. I believe I found the course while browsing WebOodi.

GitHub Repository


Exercise 2 (Regression and model validation)

Exercice 2 analysis

library(ggplot2)
library(GGally)

Let’s read the dataset to start.

2.1 Read the dataset and explore its dimensions and structure

Let’s look at the dimensions

## [1] 166   7

We can see that the table consists of 166 observations and 7 variables (i.e. 166 rows and 7 columns when viewed as a table).

Now, let’s look at the structure

## 'data.frame':    166 obs. of  7 variables:
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...

Both from looking at the data and from the context of the exercise, we can see that each observation encodes a single student: their gender, age, attitude, exam points and the deep (deep approach), stra (strategic approach), surf (surface approach) variables which encode the student’s mean answers to sets of questions meant to measure their approached to learning. All variables except gender (encoded as strings) are encoded as integers.

2.2 Examining the data

summary(learning2014)
##       age           gender              points         attitude    
##  Min.   :17.00   Length:166         Min.   : 7.00   Min.   :1.400  
##  1st Qu.:21.00   Class :character   1st Qu.:19.00   1st Qu.:2.600  
##  Median :22.00   Mode  :character   Median :23.00   Median :3.200  
##  Mean   :25.51                      Mean   :22.72   Mean   :3.143  
##  3rd Qu.:27.00                      3rd Qu.:27.75   3rd Qu.:3.700  
##  Max.   :55.00                      Max.   :33.00   Max.   :5.000  
##       deep            surf            stra      
##  Min.   :1.583   Min.   :1.583   Min.   :1.250  
##  1st Qu.:3.333   1st Qu.:2.417   1st Qu.:2.625  
##  Median :3.667   Median :2.833   Median :3.188  
##  Mean   :3.680   Mean   :2.787   Mean   :3.121  
##  3rd Qu.:4.083   3rd Qu.:3.167   3rd Qu.:3.625  
##  Max.   :4.917   Max.   :4.333   Max.   :5.000
female <- sum(learning2014$gender == "F")
male <- sum(learning2014$gender == "M")
print(paste("The datset contains", toString(male), "male and", toString(female), "female students"))
## [1] "The datset contains 56 male and 110 female students"

From above we can see the means, medians, ranges (min, max) and quartiles of the variables. For example, the students’ ages range from 21 to 55 with a mean of 25.5 and a median of 22.

110 of the students are female and 56 are male

Other than age, which skews to the younger side, the variables look more or less normally distributes (except gender, of course, which is a categorical variable). Points seem to also have a small peak at the lower side (i.e. a small group of students got a very low amount of points and the rest seem normally distributed around the middle).

The highest correlation (0.43) is between attitude and points. Surface and deep have a moderate negative correlation (-0.32) a well. Maybe a slight correlation between surf and stra and surf and attitude as well.

We also plot data from male and female students separately.

Doesn’t seem to be any major differences except maybe in attitude.

2.3 Building our regression model

Now, let’s create our regression model. Points will be dependent variable and we’ll choose attitude (attitude toward statistics), stra (strategic approach score) and surf (surface approach score) as the independent variables. The exercise didn’t seem to specify which one’s we should use so let’s just go with these.

## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

We can see that attitude is the only significant variable in the model (i.e p < 0.05). The coefficient for attitude is 3.4 (i.e. when an increase of one in attitude score predicts an increase of 3.4 in exam points). Let’s get rid of the non-significant variables and fit a new model.

2.4 Summary of the model.

## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Attitude remains highly significant. With an estimate of ~3.5, the model predicts 3.5 increase in points when attitude increases by one. With an intercept of 11.6 this means that for someone with an attitude of 0 the model would predict a points score of 11.6 and for someone with an attitude of 5 it would predict a points score of 11.6 + 3.5*5 = 29.1.

R-Squared of 0.19 means that 19% of the variance of the dependent variable (i.e. exam points) can be explained by the the independent variable (attitude).

2.5 Diagnostic Plots

Let’s do some plots for further investigation. Let’s start by plotting residuals vs fitted values.

Residual variance seems quite constant (maybe slightly smaller at high points but hard to tell due to smaller number of observations as well) across across the fitted values, i.e. the assumption of equal variances seems reasonable.

There are a a few residuals in the -15 - -20 range suggesting possible outliers. Q-Q plot appears quite linear suggesting that the data is normally distributed.

Leverage essentially describes how much of an impact a data point has on the model. Note a high-leverage point does not necessarily mean that it’s an outlier, it could fit very well with the model but have a high-leverage due to being far away from other data points (We can see this in fact above in fact, the highest-leverage point (approx 0.04) in the data has a fairly small residual).

Regardless, none of the leverages are particularly large and even the relatively larger ones don’t appear to have particularly large residuals.

In summary, the model assumptions appear valid.


Exercise 3 (Logistic regression)

Exercice 3 analysis

3.2 Read Data

Start by reading the dataset and looking up the variable names

alc <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", header=TRUE,  sep = ",")
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The dataset describes students in two Portuguese schools. Each observation describes a single student, including their age, gender, performance at school. For this exercise we’ll be mainly looking at alcohol consumption.

3.4 Hypotheses about alcohol use

Specifically, we’ll examine the relationship between high alcohol consumption and final student grades (G3), family relationship (famrel), absences, and age.

Intuitively we would expect high alcohol use to correlate with worse grades, possibly worse family relationships, more absences and higher age.

Lets look at grades first.

3.5 Exploration

ggplot(data = alc, aes(x = G3, fill = high_use)) + geom_bar()

table(highuse = alc$high_use, grades = alc$G3)
##        grades
## highuse  0  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##   FALSE 33  0  3 13  5 15 18 34 31 19 13 24 28 13  4 11  5  1
##   TRUE   6  1  4  2  2 16  9 22 12 11 12  3  4  4  2  2  0  0
ggplot(alc, aes(x = high_use, y = G3)) + geom_boxplot()

Students with the highest grades seem to mostly be in the lower use category.

Let’s look at family support next.

ggplot(data = alc, aes(x = famrel, fill = high_use)) + geom_bar()

table(highuse = alc$high_use, family = alc$famrel)
##        family
## highuse   1   2   3   4   5
##   FALSE   7   9  40 131  83
##   TRUE    2   9  26  52  23
for (i in sort(unique(alc$famrel))) {
  sub = alc[alc$famrel==i,]
  high = sum(sub$high_use==TRUE)
  low = sum(sub$high_use==FALSE)
  ratio = round(high/(low + high)*100, digits = 2)
  print(paste(toString(ratio), "% of students with a family relationship of", toString(i), "are in the high alcohol category."))

}
## [1] "22.22 % of students with a family relationship of 1 are in the high alcohol category."
## [1] "50 % of students with a family relationship of 2 are in the high alcohol category."
## [1] "39.39 % of students with a family relationship of 3 are in the high alcohol category."
## [1] "28.42 % of students with a family relationship of 4 are in the high alcohol category."
## [1] "21.7 % of students with a family relationship of 5 are in the high alcohol category."

It certainly looks like a high family relationship predicts a lower use of alcohol (at least when famrel >= 2. Relatively low number samples in the famrel == 1 category might explain the low ration of high alcohol consumption).

Then absences.

ggplot(data = alc, aes(x = absences, fill = high_use)) + geom_histogram(bins = 20)

ggplot(alc, aes(x = high_use, y = absences)) + geom_boxplot()

Hmm… A bit hard to make out, but it looks like those with more absences are more likely to be in the high use category.

Finally,lets look at age

ggplot(data = alc, aes(x = age, fill = high_use)) + geom_bar()

table(highuse = alc$high_use, age = alc$age)
##        age
## highuse 15 16 17 18 19 20 22
##   FALSE 63 79 65 55  7  1  0
##   TRUE  18 28 35 26  4  0  1
for (i in sort(unique(alc$age))) {
  sub = alc[alc$age==i,]
  high = sum(sub$high_use==TRUE)
  low = sum(sub$high_use==FALSE)
  ratio = round(high/(low + high)*100, digits = 2)
  print(paste(toString(ratio), "% of students of the age of", toString(i), "are in the high alcohol category."))

}
## [1] "22.22 % of students of the age of 15 are in the high alcohol category."
## [1] "26.17 % of students of the age of 16 are in the high alcohol category."
## [1] "35 % of students of the age of 17 are in the high alcohol category."
## [1] "32.1 % of students of the age of 18 are in the high alcohol category."
## [1] "36.36 % of students of the age of 19 are in the high alcohol category."
## [1] "0 % of students of the age of 20 are in the high alcohol category."
## [1] "100 % of students of the age of 22 are in the high alcohol category."

It looks like alcohol consumption somewhat jumps up at at the age 17 and above.

3.5 Logistic Regression

logreg <- glm(high_use ~ G3 + famrel + absences + age, data = alc, family = "binomial")
summary(logreg)
## 
## Call:
## glm(formula = high_use ~ G3 + famrel + absences + age, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7108  -0.8076  -0.6780   1.2143   1.9825  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.645322   1.795231  -1.474 0.140609    
## G3          -0.009729   0.025802  -0.377 0.706121    
## famrel      -0.277364   0.122979  -2.255 0.024110 *  
## absences     0.062522   0.017618   3.549 0.000387 ***
## age          0.155649   0.102233   1.522 0.127887    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 436.02  on 377  degrees of freedom
## AIC: 446.02
## 
## Number of Fisher Scoring iterations: 4

Both family relationship and absences have a significant effect. Neither age or grades have a significant effect.

Let’s look at the odd ratios and their 95% confidence intervals.

exp(cbind(coef(logreg), confint(logreg)))
## Waiting for profiling to be done...
##                              2.5 %    97.5 %
## (Intercept) 0.07098251 0.002037894 2.3734971
## G3          0.99031793 0.941826103 1.0423755
## famrel      0.75777871 0.594401560 0.9644452
## absences    1.06451794 1.030446671 1.1039063
## age         1.16841640 0.956650325 1.4300500

Increase of one in absences increases the odds of a student being in the high alcohol use category by approximately 1.06. (lower and upper boundaries of 1.03 and 1.10). While this might seem low considering that a student can have lots of absences this can actually a relatively large effect. For example, for a student with 20 absences, the model would predict the odds of being in the high alcohol category would rise to approximately 3.2 times that of a student with no absences.

In contrast, a high family relationship lowers the odds of a student being in the high alcohol category. For example, according to the model the odds of being in the high alcohol category is approximate 0.76 times for a student with a family relationship of 5 than a student with a family relationship of 4. The upper boundary of the confidence interval is below 1 as well giving us some confidence that the

In summary, the model supports our hypotheses about absences and family relationships, but not grades or age.

3.5 Predictive power of the model

Let’s explore the predictive power of the model

prob <- predict(logreg, type = "response")

alc <- mutate(alc, probability = prob)
alc <- mutate(alc, prediction = prob > 0.5)
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   260   10
##    TRUE     98   14

The training error is (98 + 10)/382 = 0.283 with most errors being false negatives. This is certainly better than guessing at 50% probability, but given that only about 30% of students are in the high alcohol use category are in the first place, it’s actually not all that much better than just assuming that a student is in the low-alcohol consumption group.


Exercise 4 (Clustering and classification)

Exercice 4 analysis

Let’s start by loading the data and looking at its structure.

4.2 Loading and exploring the dataset

library(MASS)

data(Boston)
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The dataset regards housing values in different towns of the Boston metropolitan area. It has 506 observations with 14 variables including the median value of owner-occupied homes in the town (medev), the per capita crime rate of the town (crim), and proportion of non-retail business acres per town (indus). The descriptions for all the variables can be found on here.

4.3 Graphical overview and summary of the data

Let’s plot a correlation matrix to visualize possible correlations between variables.

#load the libraries
library(corrplot)
library(tidyverse)

#generate and plot correlation matrices
corr_matrix <- cor(Boston)
corr_matrix %>% round(digits = 2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
corrplot(corr_matrix, method="circle", type = "upper", cl.pos = "b", , order = "hclus", tl.pos = "d", tl.cex = 0.6)

We can see some strong positive correlations including tax-tad, tax-indus, nox-indus, and nox-age. Strong negative correlations include: dis-nox, dis-indus, dis-age, and medv-lstat.

For example if look at nox (nitrogen oxides concentration) and dis (distance to Boston employment centres), we can see a clear relationship.

ggplot(data = Boston, aes(x = nox, y = dis)) + geom_point()

Now, let’s look at the summary.

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

We can see that there’s quite a bit of range in regards to many of the variables. Median value of owner-occupied homes for example varies from $5000 to $50000 (we can see that the dataset is rather old considering how low the prices area).

4.4 Standardize the dataset

Let’s standardize the dataset with the scale function

scaled_data <- scale(Boston)
scaled_data <- as.data.frame(scaled_data)

summary(scaled_data)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

What we’ve done is scale each variable by subtracting the mean from each value and dividing by the standard deviation. This means that the mean of each variable is set to 0 and each value tells us how many standard deviations away from 0 the value is.

Now, let’s create a categorical variable of the crime rate.

bins <- quantile(scaled_data$crim)
crime <- cut(scaled_data$crim, breaks = bins, labels = c("low","med_low","med_high","high"), include.lowest = TRUE)

table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127

Now, let’s replace the old crim variable with the new one.

scaled_data <- dplyr::select(scaled_data, -crim)
scaled_data <- data.frame(scaled_data, crime)

And now let’s split the data into training and test sets.

ind <- sample(nrow(scaled_data), size = nrow(scaled_data)*0.8)
train <- scaled_data[ind,]
test <- scaled_data

4.5 Fit the LDA

Let’s fit linear discriminant analysis (LDA) using the training set with our new categorial crime variable as the target variable.

lda_fit <- lda(crime ~ ., data = train)

Now let’s visualize the LDA with a biplot.

lda.arrows <- function(x, myscale = 1, arrow.heads = 0.1, color = 'purple', tex = 0.75, choices = c(1,2)) {
  heads = coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale*heads[,choices[1]],
         y1 = myscale*heads[,choices[2]],
         col = color, length = arrow.heads)
  text(myscale*heads[,choices], labels = row.names(heads), cex = tex, col = color, pos = 3)
}
classes = as.numeric(train$crime)

plot(lda_fit, dimen = 2, col = classes)
lda.arrows(lda_fit, myscale = 2)

4.6 Predicting the classes

First, let’s save and remove the crime variable from the test set.

correct <- test$crime
test <- dplyr::select(test, -crime)

Now, let’s predict

predictions <- predict(lda_fit, newdata = test)
t <- table(correct = correct, predicted = predictions$class)
t
##           predicted
## correct    low med_low med_high high
##   low       75      44        8    0
##   med_low   18      88       20    0
##   med_high   1      23       96    6
##   high       0       0        1  126

The high class is predicted very well, but low is quite often mistaken as med_low and med_high as med_low. Still, the model seems quite reasonable; almost all of the errors seem are one category away from the real class.

4.7 K-means algorithm

Let’s reload the dataset, standardize it and calculate the distances between observations.

data(Boston)
scaled_data <- scale(Boston)
dist <- dist(scaled_data)

Let’s first just run a k-means algorithm with 3 clusters.

km <-kmeans(scaled_data, centers = 3)
pairs(scaled_data, col = km$cluster)

Hmm… can’t make too much sense out of the clusters with this visualization. Let’s get back to this later.

Now, let’s try to figure out how many clusters we should optimally use by looking at how WCSS (within cluster sum of squares) behaves when we change the number of clusters.

library(ggplot2)

set.seed(123)

k_max <- 20
twcss <- sapply(1:k_max, function(k){kmeans(scaled_data, k)$tot.withinss})

qplot(x = 1:k_max, y = twcss, geom = 'line')

Based on the big drop around 2, let’s use 2 as our number of klusters.

Now let’s run our k-means algorithm.

km <-kmeans(scaled_data, centers = 2)
scaled_data <- as.data.frame(cbind(scaled_data, km$cluster))

Now let’s visualize the clusters.

pairs(scaled_data, col = km$cluster)

It’s a little hard to make out. Let’s try another way to visualize the clusters.

library(GGally)

scaled_data <- data.frame(scaled_data)
ggpairs(scaled_data, aes(col = as.factor(km$cluster), alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

The clusters appear rather distinct now. Blue cluster seems to include mostly towns that have more industry, higher accessibility to highways, higher taxes, have worse air quality (nox), have lower status population, are closer to the Boston employment centers and have lower-valued homes.


Exercise 5 (Dimensionality reduction techniques)

Exercice 5 analysis

5.1 Graphical overview

Let’s start by loading the data and looking at its structure.

human <- read.csv('./data/human.csv', row.names = 1)

str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

The data contains 155 rows with 8 variables. Each row describes a single country’s life expectancy (Life.Exp), maternal mortality rate (Mat.Mor), expected years of schooling (Edu.Exp), gross national income per capita (GNI), adolescent birth rate (ado.birth), proportion of women in parliament (Parli.F), female/male ratio in labour force (Labo.FM) and female/male ratio who have attained secondary level education (Edu2.FM).

Next, we’ll print out the summaries of the variables.

summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Let’s look at some visualizations.

ggpairs(human)

cor(human) %>% corrplot(method="circle", type = "upper", cl.pos = "b", , order = "hclus", tl.pos = "d", tl.cex = 0.6)

In terms of distributions, it’s clear that many of the variables do not follow a normal distribution. For example GNI is heavily skewed to the left.

In terms of relationships, We can see correlations between many of the variables. For example, maternal mortality has a negative correlation with Life.Exp, Edu.Exp and Edu2.FM and a positive correlation with Ado.Birth. Life expectancy on the other hand has a positive correlation with Edu.Exp, GNI and Edu2.FM.

5.2 Principal component analysis

Let’s perform a PCA on the non-standardized dataset and do a bi-plot for the first two principal components.

pca_human <- prcomp(human)
summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

We can see that effectively all (99.99%) of the variance in the data is explained by the PC1 component (which seems to be mainly driven by GNI). This is not very useful. Let’s see if we can do better by standardizing the data.

5.3-5.4 PCA after standardization of the variables

First, lets standardize all the variables.

human_scaled <- scale(human)

Now lets re-do the pca and the bi-plot.

pca_human_std <- prcomp(human_scaled)
  
summary(pca_human_std)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
biplot(pca_human_std, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))

When looking at the dimensions of the first two principal components, we can see that higher Edu.Exp, GNI, Edu2.FM and Life.Exp drive PC1 to the left whereas higher Mat.Mor and Ado.Birth to the right. PC2 is mainly contributed to by Parl.F and Labo.FM.

Based on this, we can see that richer and more developed countries (countries with higher GNI,education and life expectancy) tend to be situated on the left and poorer ones on the right. Countries with higher female representation in parliament and/or participation in the labour force trend upwards and countries with lower female labor participation and parliamentary representation downward. For example, Iceland which is both highly developed and has high female participation in the workforce and parliament is on the upper left whereas Qatar which is rich but has low female representation in parliament and workforce is on lower right.

5.5 Tea dataset

Let’s load and look at the dataset

library(FactoMineR)
library(tidyr)

data(tea)

summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

There’s a lot of stuff, so let’s only include the variables we are interested in (same as in the datacamp exercise).

keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_data <- dplyr::select(tea, one_of(keep_columns))

dim(tea_data)
## [1] 300   6
str(tea_data)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
gather(tea_data) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

The dataset describes participant’s answers to a questionnaire on tea. There are 300 rows and 6 variables now. Next, we’ll conduct and visualize the MCA.

mca <- MCA(tea_data, graph = FALSE)

summary(mca)
## 
## Call:
## MCA(X = tea_data, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
plot(mca, invisible=c("ind"), habillage = "quali")

Color indicates which variable the category belongs and distance between points measures their similarity. For example those who buy/use unpackaged tea also tend to buy from tea shops whereas as those who use/buy tea bags tend to buy from chain stores.


Exercise 6 (Analysis of longitudinal data)

Exercice 6 analysis

6.1 RATS analysis

Let’s first read the dataset (and factor the categorical variables)

RATSL <- read.csv("data/ratsL.csv")
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)

The data set uses data from a nutrition study for three groups of rats. The three groups were put on different diets and the rats’ weights were measured multiple times over a 9 week period.

Let’s examine the data through some visualizations.

p1 <- ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID))
p2 <- p1 + geom_line() + scale_linetype_manual(values = rep(1:10, times=2))
p3 <- p2 + facet_grid(. ~ Group, labeller = label_both)
p4 <- p3 + theme_bw() + theme(legend.position = "none")
p5 <- p4 + theme(panel.grid.minor.y = element_blank())
p6 <- p5 + scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
p6

We can see that the rats in Group 1 start (and end) up much lighter than those in Group 2 or 3. Most of the rats appear to gain weight over time.

One rat in Group 2 looks like bit of an outlier, being much heavier than the other rats.

Let’s standardize the weights (grouped by group and time) and look again.

RATSL <- RATSL %>%
  group_by(Time, Group) %>%
  mutate( stdbprs = (Weight - mean(Weight))/sd(Weight) ) %>%
  ungroup()


ggplot(RATSL, aes(x = Time, y = stdbprs, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=2)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "standardized bprs")

Hmmm… there doesn’t appear to be any major shifts between the rats.

Now let’s do a mean weight development for the three group of rats.

RATSS <- RATSL %>%
  group_by(Group, Time) %>%
  summarise( mean=mean(Weight), se=sd(Weight)/sqrt(length(Weight)) ) %>%
  ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
p1 <- ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group))
p2 <- p1 + geom_line() + scale_linetype_manual(values = c(1,2,3))
p3 <- p2 + geom_point(size=3) + scale_shape_manual(values = c(1,2,3))
p4 <- p3 + geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3)
p5 <- p4 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p6 <- p5 + theme(legend.position = c(0.8,0.8))
p7 <- p6 + scale_y_continuous(name = "mean(bprs) +/- se(bprs)")
p7

All three groups on average appear to gain a little bit of weight. Group 2 and 3 perhaps more so than group 1.

Let’s do some boxplots.

p1 <- ggplot(RATSL, aes(x = factor(Time), y = Weight, fill = Group))
p2 <- p1 + geom_boxplot(position = position_dodge(width = 0.9))
p3 <- p2 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p4 <- p3 + theme(legend.position = c(0.8,0.8))
p5 <- p4 + scale_x_discrete(name = "Day")
p5

RATSS2 <- RATSL %>%
  filter(Time > 1) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(Weight) ) %>%
  ungroup()

glimpse(RATSS2)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9...
ggplot(RATSS2, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(bprs), days 8-64")

It was a bit hard to make out before but here we can see that rats from group 3 are in fact on average a bit heavier than those from group 2 (group 1 is of course by far the lightest). We can also see a big outlier in group 2 (almost certianly the heavy rat we saw before). Let’s remove it.

RATSL_Filt <- RATSS2 %>%
  filter(mean < 580) 


ggplot(RATSL_Filt, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(bprs), days 8-64")

Gone!

Alright, now let’s do some testing with ANOVA.

#read the original data and remove the outlier rat
RATS_o <-  read.csv("data/rats.csv")
RATS_o <- RATS_o %>%
  filter(ID != 12) 

#now let's add the first day values as a baseline variable
RATSL_Filt2 <- RATSL_Filt %>%
  mutate(baseline = RATS_o$WD1) 
fit <- lm(mean ~ baseline + Group, data = RATSL_Filt2)
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value    Pr(>F)    
## baseline   1 207817  207817 2714.9357 1.601e-14 ***
## Group      2   1483     742    9.6878  0.003748 ** 
## Residuals 11    842      77                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hmm… we can see from the ANOVA that there are indeed statistically significant differences between the groups (p = 0.004).

6.2 BPRS analysis

Let’s load up (and factor the categorical variables) and look at the dataset.

BPRSL <- read.csv("./data/bprsL.csv")

BPRSL$subject <- factor(BPRSL$subject)
BPRSL$treatment <- factor(BPRSL$treatment)

glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "we...
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 6...
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

The data describes 40 male subjects who were assigned to two different treatment groups. They were measures on a brief psychiatric rating scale (BPRS) on a weekly basis for 8 weeks (+ a measurement at the start). BPRS measures symptoms such as hostility, hallucinations and suspiciousness.

Let’s do some plotting.

p1 <- ggplot(BPRSL, aes(x = week, y = bprs, group = subject))
p2 <- p1 + geom_text(aes(label = treatment), color = c("blue", 'red')[BPRSL$treatment])
p3 <- p2 + scale_x_continuous(name = "Weel", breaks = seq(0, 60, 10))
p4 <- p3 + scale_y_continuous(name = "BPRS")
p5 <- p4 + theme_bw()
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p6

This is a bit hard to make out. Doesn’t seem to be any obvious difference between the groups. Treatment 2 gets the top value every week but this could be due to a single participant with particularly high bprs values.

Let’s check individual response profiles.

p1 <- ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject))
p2 <- p1 + geom_line() + scale_linetype_manual(values = rep(1:10, times=2))
p3 <- p2 + facet_grid(. ~ treatment, labeller = label_both)
p4 <- p3 + theme_bw() + theme(legend.position = "none")
p5 <- p4 + theme(panel.grid.minor.y = element_blank())
p6 <- p5 + scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
p6

Looks like the bprs scores tend to go down with time. Hard to see any differences with the groups (other than one participant in treatment group 2 having rather high scores compared to others).

Let’s fit a simple linear regression model with bprs as response variable and week and treatment as explanatory variables (ignoring the repeated-measures structure for now).

BPRSL_Reg <- lm(bprs ~ week + treatment, data = BPRSL)
summary(BPRSL_Reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

We can see that week has a significant effect, but treatment doesn’t. Now let’s fit a random intercept model with subject as random effect.

library("lme4")
library("afex")

BPRSL_Ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
summary(BPRSL_Ref)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  46.4539     1.9090  37.2392  24.334   <2e-16 ***
## week         -2.2704     0.2084 340.0000 -10.896   <2e-16 ***
## treatment2    0.5722     1.0761 340.0000   0.532    0.595    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

We can see that only week has a significant effect (bpsr scores drop as week increases).

Now, let’s also fit a random intercept and random slope model.

BPRSL_Ref1 <- lmer(bprs ~ week + treatment + (treatment | subject), data = BPRSL, REML = FALSE)
summary(BPRSL_Ref1)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (treatment | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2584.7   2611.9  -1285.4   2570.7      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.3048 -0.6022 -0.0652  0.4414  3.1690 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept)  64.42    8.026        
##           treatment2  188.83   13.742   -0.56
##  Residual              54.23    7.364        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  46.4539     1.9708  24.3001  23.571   <2e-16 ***
## week         -2.2704     0.1503 320.0001 -15.104   <2e-16 ***
## treatment2    0.5722     3.1692  19.9999   0.181    0.859    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.305       
## treatment2 -0.540  0.000

Same as before.

And a random intercept and random Slope Model with interaction between treatment and week.

BPRSL_Ref2 <- lmer(bprs ~ week * treatment + (treatment | subject), data = BPRSL, REML = FALSE)
summary(BPRSL_Ref2)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: bprs ~ week * treatment + (treatment | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2581.0   2612.1  -1282.5   2565.0      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.2782 -0.6037 -0.0797  0.4437  3.3942 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept)  64.53    8.033        
##           treatment2  189.04   13.749   -0.56
##  Residual              53.27    7.298        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)      47.8856     2.0573  28.8053  23.275   <2e-16 ***
## week             -2.6283     0.2107 319.9999 -12.475   <2e-16 ***
## treatment2       -2.2911     3.3859  26.0248  -0.677   0.5046    
## week:treatment2   0.7158     0.2980 319.9999   2.402   0.0169 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.410              
## treatment2  -0.586  0.249       
## wek:trtmnt2  0.290 -0.707 -0.352

This model suggests there’s a positive interaction between week and treatment, in addition to the negative effect of week.

Now, let’s compare the models.

anova(BPRSL_Ref, BPRSL_Ref1)
## Data: BPRSL
## Models:
## BPRSL_Ref: bprs ~ week + treatment + (1 | subject)
## BPRSL_Ref1: bprs ~ week + treatment + (treatment | subject)
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## BPRSL_Ref     5 2748.7 2768.1 -1369.4   2738.7                         
## BPRSL_Ref1    7 2584.7 2611.9 -1285.4   2570.7 167.97  2  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(BPRSL_Ref1, BPRSL_Ref2)
## Data: BPRSL
## Models:
## BPRSL_Ref1: bprs ~ week + treatment + (treatment | subject)
## BPRSL_Ref2: bprs ~ week * treatment + (treatment | subject)
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRSL_Ref1    7 2584.7 2611.9 -1285.4   2570.7                       
## BPRSL_Ref2    8 2581.0 2612.1 -1282.5   2565.0 5.7203  1    0.01677 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(BPRSL_Ref,BPRSL_Ref2)
## Data: BPRSL
## Models:
## BPRSL_Ref: bprs ~ week + treatment + (1 | subject)
## BPRSL_Ref2: bprs ~ week * treatment + (treatment | subject)
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## BPRSL_Ref     5 2748.7 2768.1 -1369.4   2738.7                         
## BPRSL_Ref2    8 2581.0 2612.1 -1282.5   2565.0 173.69  3  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on both the p-values and the AIC estimator, BPRSL_Ref2 (one with the interaction) appears to perform the best (about the same as BPRSL_Ref1).

Let’s also look at the fitted values and how they look against the observed values (which we already plotted before).